Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean)#1350
Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean)#1350resouer wants to merge 1 commit intoopenai:mainfrom
Conversation
3-seed mean 1.0046 (std 0.0003). Beats merged SOTA (1.1147) by 0.110. Novel: L-BFGS causal SLOT — optimizer (L-BFGS), space (logit), and constraint (causal, context-only positions). Passes flip test (PR openai#1240). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive analysis of current leaderboard state (Apr 4, 2026): - Non-SLOT frontier at 1.0897 BPB (PR openai#1334) - Pre-quant TTT adds -0.009 BPP (PR openai#1351, 1.0807 BPB) - Causal SLOT adds -0.088 BPP (PR openai#1350, 1.0046 BPB) - GPTQ+TTT incompatibility confirmed post-quant, works pre-quant - FiLM gap analysis: ~0.05-0.09 BPP behind frontier - Three strategic paths identified Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This is the clearest causal-SLOT legality writeup I’ve seen so far. The part I especially appreciate is that it maps the method directly onto the four current conditions:
I think that kind of One thing that would make it even more helpful as a reference for others would be a tiny implementation sketch in the PR body, e.g. something like:
That would make the Condition 3 / Condition 4 story even more concrete for reviewers trying to compare proposals apples-to-apples. |
Causal SLOT v1 (broadcast delta + logit bias with AdamW) actively hurts performance (+0.009 BPB). Root cause: broadcast delta optimized on context shifts all hidden states, damaging new-position predictions. New modes: - logit_only: AdamW on logit bias only (no hidden delta) - lbfgs: L-BFGS on delta + logit bias (faster convergence) - lbfgs_logit: L-BFGS on logit bias only (matches PR openai#1350 approach) PR openai#1350 achieves -0.088 BPB with L-BFGS causal SLOT in logit space. Hypothesis: removing hidden delta + using L-BFGS will fix our causal SLOT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pproach) Three key improvements matching PR openai#1350's L-BFGS causal SLOT: 1. Focal context (SLOT_FOCAL_CTX=128): optimize on last 128 context tokens only, not all context. Nearby tokens are more predictive of new positions. 2. Warm-start (SLOT_WARMSTART=1): carry mean logit bias between batches for faster convergence on consecutive windows. 3. Clamping (SLOT_CLAMP=5.0): limit logit bias magnitude to prevent overfitting, matching PR openai#1350's delta clamp of +/-5. 4. Increased L-BFGS history to 20 (from 10). Initial test: lbfgs_logit with just 4 steps gave 1.2658 BPB vs 1.3095 from v1 causal (24 steps), confirming L-BFGS + logit-only approach works. Full 24-step test with focal+warmstart+clamp running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key finding: L-BFGS logit-only causal SLOT gives -0.035 BPB (4 steps) vs v1's +0.009 (24 steps). Confirms root cause diagnosis. Causal SLOT v2 test script compares: - v2_full: focal=128, warmstart, clamp=5, 25 steps (PR openai#1350 approach) - v2_50steps: same but 50 steps (check if more steps help) - v2_nofocal: all context (ablation) - v2_adamw: AdamW instead of L-BFGS (optimizer ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v2 (focal+warmstart+clamp) gives identical 1.2658 BPB to v1 L-BFGS. L-BFGS converges too fast for these tricks to matter. Competitiveness analysis: - FiLM beats SOTA by -0.095 BPP on 1×H100 - Extrapolated 8×H100: ~1.00-1.05 BPB - Should beat non-SLOT frontier (PR openai#1334: 1.09) - Uncertain vs causal SLOT frontier (PR openai#1350: 1.00) because our causal SLOT gives -0.035 vs their -0.087 8×H100 test is worth running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Low-rank hidden→logit correction (r=8, position-dependent) gives exactly 1.2658 BPB — same as broadcast logit bias. This proves the optimal correction is position-independent at this model quality. The -0.035 BPP is a hard ceiling for logit-space causal SLOT on this base model (1.30 BPB). Better base model (8×H100) should raise the ceiling to -0.06 to -0.08 based on PR openai#1350's -0.087 from a 1.09 base. Complete SLOT mode comparison on 1×H100 FiLM SP1024: - v1 (AdamW delta+bias): +0.009 (HURTS) - logit_only (AdamW): untested (but expected ~-0.02) - lbfgs_logit: -0.035 (4-24 steps identical) - lbfgs_logit v2 (focal+warm+clamp): -0.035 (no change) - lowrank (r=8): -0.035 (no change) - Standard SLOT (illegal): -0.397 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
All PRs I've seen below 1.05 bpb has had something invalid that ClaudeCode/Codex can immediately catch. In this case: |
…lysis Novel ideas explored (Bitter Lesson aligned): - GDN hybrid: KILLED — FA3 is 3-16x faster than GDN on H100 - ACT transformer: KILLED — no training speedup (all iters must run for gradients) - 3x5 (512d): 517ms/step, 1.893 BPB vs baseline 331ms/step, 1.722 BPB - 3x5 (768d): 923ms/step, ~2.08 BPB — wider doesn't help - Root cause: ACT only helps when computation can actually be skipped during training Competition frontier analysis: - Legal record frontier: 1.005 BPB (PR openai#1350, L-BFGS causal SLOT) - Clean base frontier: 1.0897 BPB (PR openai#1334, SP4096+DepthRecur+MuonEq-R) - SLOT adds -0.087 BPB on top of base Remaining novel ideas to test: parallel SLOT beams, amortized SLOT, learned weight compression, progressive depth training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Based on PR openai#1350 (1.0046 BPB). Eval-time logit-space delta optimization: - Delta [1,1,vocab] optimized via L-BFGS (25 iters, history=20) - Loss computed ONLY on already-scored context positions (causal) - Warm-started across windows, clamped ±5.0 - GPT class split: forward_hidden() + compute_logits() - Activated via SLOT_ENABLED=1 env var Also includes EMA + depth recurrence fix from prior commit.
records/track_10min_16mb/2026-04-04_LBFGS-CausalSLOT_1.0046/train_gpt.py
Show resolved
Hide resolved
|
Closing this PR. Two independent compliance issues identified:
|
Summary
3-seed mean val_bpb: 1.0046 (std 0.0003) | ~15.8 MB | 8xH100 SXM | ~556s SLOT eval
Merged SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 1.69620 nats. Delta: -0.186 nats. Clears the 0.005-nat threshold.
Results (3-seed)
Changes from Merged SOTA (PR #1019)
1. L-BFGS Causal SLOT in Logit Space (Novel)
Standard SLOT optimizes delta using loss from ALL positions including future ones — PR #1240 proved 100% causal violation. Our causal SLOT restricts optimization to already-scored context positions only. L-BFGS optimizer in logit space (max_iter=25, history=20, focal loss on last 128 tokens, warm-start, delta clamp +/-5). Delta: -0.087 BPP, ~556s eval.
Nearest PR: #1318 (L-BFGS logit SLOT, non-causal). Different: causal constraint on optimization — loss from context positions only.
2. Pre-quant AdamW TTT (6 epochs)
AdamW TTT on full-precision EMA weights before GPTQ. Delta: -0.022 BPP, 110s.
3. Coprime-stride multi-shard data loader
Weighted random shard sampling with coprime stride. Delta: -0.003 BPP.
4. Config (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)
Delta: ~-0.003 BPP combined.
Compliance
Satisfies all four NoesisGenesis conditions (Issue #677):
p_tdepends only on artifact and prefixx_1...x_{t-1}— causal SLOT uses only already-scored positionsModel weights never modified during eval. Only per-window throwaway delta (1024 floats) is optimized then discarded.
Implementation sketch (per @dexhunter's suggestion)
For each sliding window
w(stride=64, seq_len=2048):wwith frozen model +torch.no_grad→ getlogits_base[focal_start, s)wheresis the boundary of already-scored context from previous windows. These are the only positions used for optimization — new tokens in[s, end)are excluded.logits_base + deltausing ONLY the masked (already-scored) positions. Delta is in logit space[1, 1, vocab_size], warm-started from previous window, clamped to +/-5.[s, end)usinglogits_base + delta— the delta was optimized without seeing these tokens' targets, so their scores depend only on the artifact and the prefix.This ensures Condition 1 (delta at position
twas optimized without access to tokentor any token after it) and Condition 3 (new tokens are scored with a delta that was fixed before scoring them).Reproduction
Credits
Base: PR #1019 (@abaybektursun). Pre-quant TTT: PR #1006. Coprime loader: PR #1184 (@icryo). L-BFGS SLOT concept: PR #1318. Causal SLOT: our PR #1306. Implementation sketch suggestion: @dexhunter.